On the reliability and the limits of inference of amino acid sequence alignments

Abstract Motivation Alignments are correspondences between sequences. How reliable are alignments of amino acid sequences of proteins, and what inferences about protein relationships can be drawn? Using techniques not previously applied to these questions, by weighting every possible sequence alignment by its posterior probability we derive a formal mathematical expectation, and develop an efficient algorithm for computation of the distance between alternative alignments allowing quantitative comparisons of sequence-based alignments with corresponding reference structure alignments. Results By analyzing the sequences and structures of 1 million protein domain pairs, we report the variation of the expected distance between sequence-based and structure-based alignments, as a function of (Markov time of) sequence divergence. Our results clearly demarcate the ‘daylight’, ‘twilight’ and ‘midnight’ zones for interpreting residue–residue correspondences from sequence information alone. Supplementary information Supplementary data are available at Bioinformatics online.


S1 Derivation of EAD recurrences given in the main text Section 3.2
Beyond the notations introduced in section 3.2 of the main text, the derivation below uses the following additional ones: A the set of all possible alignments between S, T . A (i,j) the set of all possible alignments of their prefixes S 1...i , T 1...j of the sequences. A m (i,j) the subset of all alignments of prefixes that end in a match(m) state at cell (i, j). A i (i,j) the subset of all alignments of prefixes that end in a insert(i) state at cell (i, j). A d (i,j) the subset of all alignments of prefixes that end in a delete(d) state at cell (i, j). A m (i,j) any alignment of prefixes that ends in a match(m) state at (i, j). A i (i,j) any alignment of prefixes that ends in a insert(i) state at (i, j). A d (i,j) any alignment of prefixes that ends in a delete(d) state at (i, j). A m|m (i,j) any alignment of prefixes that ends in a match(m) state at (i, j) given a match(m) state at (i − 1, j − 1). (Similar notation for all 9 possible transitions going between any two states of {match, insert, delete}.) Pr(m|m) the transition probability of going into a match given a previous match state.
(Similar notation for all 9 possible transitions going between any two states of {match, insert, delete}.) Pr( s i , t j ) the joint probability of matching a pair of amino acids, s i ∈ S and t j ∈ T .

Derivation
Starting with recurrence (6) in the main text, by the definition of EAD m (i, j), we have: But all alignments A m (i,j) that end in a match (m) state at (i, j) are derived by extending all alignments arriving at the cell (i − 1, j − 1) in any of the three alignment states ({match, insert, delete}), that is the set of alignments , by a pair of matched amino acids corresponding to the cell (i, j), that is, s i , t j . Figure SF1. All alignments ending in a match state at (i, j) cross two skew-diagonals, along which their additional distances has to be accounted for during dynamic programming Therefore, Equation 1, can be decomposed based on the above observation as: where the component joint probability terms in the r.h.s of Equation 2 are equivalent to: Further, the component distance terms in the r.h.s of Equation 2 can be expanded as: This holds because any alignment ending in a match state at (i, j) must arrive from (i − 1, j − 1), and in doing so, will cross two skew-diagonals (see Figure SF1. Also cf. Figure 1 in the main text) 1. one skew-diagonal indexed by i + j − 1 and 2. the other skew-diagonal indexed by i + j.
Thus, for all alignments going from (i − 1, j − 1) to (i, j), the component distance terms at (i − 1, j − 1) get augmented by a δ(i + j − 1) + δ(i + j), accounting for their widths/slacks with respect to the reference alignment A ref along the above two skew-diagonals.
Substituting the expanding component joint probability and distance terms shown above into Equation 2, after rearranging yields: By grouping all even terms on the r.h.s. of Equation 3 together, we get the recurrence: But the last term on the r.h.s. is the marginal probability over all alignments ending in a match at (i, j), resulting in the final form of recurrence (6) used in the main text: Recurrences (7) and (8) in the main text follow identical lines of derivations, with the only difference that they account for all alignments coming into (i, j) in a insert(i) and delete(d) states, respectively. Also, all such alignment transitions only cross a single skew-diagonal, indexed by i + j, therefore those recurrences will contain only the δ(i + j) term.

S2.1 Set of million domain pairs
In this work we randomly sampled a million domain pairs from the Structural Classification of Proteins (SCOPe v2.07) database [Murzin et al., 1995] using the random sampling method described below. The full list of scop domain pairs (including the SCOP domain identifiers and the SCOP classification level) can be downloaded from here.

S2.1.1 Random sampling method
A domain pair is randomly selected by utilising the SCOP organisation of domains within its hierarchical classification tree. The internal nodes of this tree are associated with the four-level classification of protein domains: class, fold, superfamily and family. Each domain is uniquely represented by a leaf node. A traversal from the root node to a leaf node yields a domain. The sampling procedure involves traversing from the root to a leaf node, passing each of the SCOP levels: class, fold, superfamily and family in order. At each node of this traversal, until a leaf node (domain) is reached, a child node is selected from the available children (i.e. nodes in the level below the current node), by sampling randomly based on the weights (i.e. number of leaves) of their respective subtrees. Thus, to identify domain pairs within the same superfamily but under different families, the traversal first proceeds from the root to the superfamily level. Then the weighted random sampling method selects two random domains (leaves) from two different families (child nodes). To identify pairs from the same family, a pair of its children (leaves) are randomly selected, when the traversal reaches family level nodes, while considering families with more than 2 domains.

S2.2 Five sets of domain pairs sampled at varying levels of SCOP hierarchy
Using the random sampling method described above, we further sampled sets of unique domain pairs from each hierarchical level of SCOP such that each domain appears at most once in the dataset. This comprised 5 sets of domain pairs sampled at the same family, same superfamily, same fold, same class, and decoy (different class) levels respectively. See Table ST1 for more information on the datasets.   , 1992], PAM [Dayhoff et al., 1978], MML-SUM [Sumanaweera et al., 2020] and VTML [Müller et al., 2002] and the 3 reference structure alignment programs: TM-align [Zhang and Skolnick, 2005], MMLigner [Collier et al., 2017] and DALI [Holm and Sander, 1995]. Each of these 12 possible combinations generated a million data points respectively. For each comparison, these million data points were grouped into bins in the range [1, 500] based on the inferred Markov time (time marginal ) (bottom x-axes of the plots in Fig. SF2) . Then, for each bin the first (Q1), second (Q2) and third (Q3) quartile statistics of the excepted inter-alignment distances were computed. Fig. SF2 shows the variation of the expected inter-alignmet distance versus time marginal for all the 12 possible combinations. The brown curve in the Fig. SF2 shows the % cumulative growth whose vertical scale appears on the right side of each plot. The top x-axes of the plots shows the expected %-change of amino acids in the range of [1%, ∼ 92%] which corresponds to the time-range [1, 500] for each time-parameterized models which is available here (see Sumanaweera et al. [2019] for more details).

S3.1.1 Canonicalization of reference alignments
For the computation of expected inter-alignment distance, the insert and delete blocks of all the reference (structure) alignments produced by the 3 reference alignment programs (MMLigner, TM-Align, and DALI) were converted to D * I * form as shown in the example below.
Consider an alignment between two sequences S and T : Note that the DALI structure alignment program resulted only 603,790 valid alignments out of the million domain pairs. It failed to produce an alignment relationship for some domain pairs containing multiple chains and for some domain pairs, it changed the amino acid residues in either or both of the sequences reported in the alignment corresponding to the PDB files. Therefore, these alignments were ignored in the analysis.

S3.2 Variation of the expected inter-alignment distance as a function of Markov time parameter after separating the million domain-pairs into distinct groups based on their compression statistics
We grouped the million domain pairs into three subsets based on the respective statistical significance test statistics (∆ optimal and ∆ marginal ).
1. The first group accounts for the 417, 084 domain pairs whose optimal sequence alignment under the MML framework yielded positive compression (i.e., ∆ optimal > 0). 2. The second group accounts for 273, 944 domain pairs whose optimal sequence alignment lost to the null model but gave positive compression using the marginal model (i.e., ∆ optimal ≤ 0 < ∆ marginal ). 3. The final group contains the remaining 308, 972 domain pairs where ∆ marginal ≤ 0. Figure SF3 shows the variation of the expected inter-alignment distance statistic as a function of the inferred Markov time parameter using MMLSUM time-parameterized model with respect to the reference structure alignments reported by MMLigner for the above distinct groups of domain pairs. (see supplementary website for the remaining combinations.) S3.3 Variation of the empirical inter-alignment distance as a function of Markov time parameter S3.3.1 Random sampling approach used to generate highly-probable alignments For any two sequences S and T an alignment can be sampled as a three-state string of match, insert, delete states based on a Roulette wheel selection.
Here the objective is to give every state a chance of being selected, but to make more probable states at a certain position in the alignment string more likely to be chosen over the other states. This is achieved by allocating a section in an imaginary roulette wheel where the sections are of different sizes, proportional to the marginal probability of the individual states.
Running seqMMLigner in its marginal mode generates three matrices (one for each state) containing marginal probabilities of the two sequences [Sumanaweera et al., 2019]. Given the marginal probability matrices of the two sequences S and T : T otm, T oti, and T otd, we create an imaginary roulette wheel for the (i, j) th position of the alignment string. Here, the size of the section of each state is defined using the marginal probability of the three states at the given position, i.e., T otm(i, j), T oti(i, j) and T otd(i, j). Then we spin the wheel and select the state associated with the winning section. If the selected state is m, a new imaginary roulette wheel is created and spun to select the next state in the sampling method (previous state in the alignment) at (i−1, j−1) th position. If the i state is selected, the next roulette wheel selection is done on the (i−1, j) th position. Similarly, for the d state, the next roulette wheel selection is done at the position (i, j−1). Starting from the (|S|, |T |) th position, this process is continued until the (0, 0) th position to generate a three-state string.
We continued this process multiple times to generate 1,000 randomly sampled three-state strings to cross-verify the computation of the mathematical expectation of distance, using an empirical (sampling) approach for million domain pairs. Then, the distance between each sampled alignment for a pair was compared to its corresponding reference structure alignment. The resulting estimates of distance were then averaged to generate an empirical estimate of the distance that can be compared against the computation of the mathematical expectation of inter-alignment distance. Fig. SF5 shows the variation of the empirical inter-alignment distance as a function of the inferred Markov time parameter, time marginal for the 12 combinations discussed previously.

S3.4 Comparsion of Expected inter-alignment distance and E-value
Consider two random sequences with lengths m and n respectively. Given the distribution of individual amino acid residues, and a scoring matrix, the number of distinct local alignments with a score S is said to be Poisson distributed with a mean statistic of: Figure SF4. First, second and third quartile statistics of the empirical inter-alignment distance with respect to a given reference alignment as a function of the inferred Markov time parameter, timemarginal. The legends of the plots mirrors that of Figure SF2 where λ and K can be calculated from the scoring matrix and the average sequence compositions based on the Poisson distribution. This is known as the E-value which gives the expected number of distinct optimal local alignments with a score of at least S [Karlin and Altschul, 1990, Altschul et al., 2001.
A subsequent study by Levitt and Gerstein [1998] have compared sequence alignment significance against structure alignment significance using the resultant E-values, and identified a statistical significance threshold of 1% which agrees for both types of alignments. With this work, they substantiated E-value as a common ground to evaluate alignment scores across different programs.
Here we computed the E-value for the million structure alignments generated by TM-Align program using the BLOSUM62 matrix and the associated default parameters used in Blast. Figure SF6 shows the variation of the E-value as a function of the inferred Markov time parameter (time marginal ) for the million domain pairs. Note that the left y-axis is in natural logarithmic scale. Figure SF5. First, second and third quartile statistics of the natural logarithm of E-value with respect to a reference alignment obtained using TM-Align program as a function of the inferred Markov time parameter using BLOSUM model.
In Figure SF6, the median of E-value falls below 0.01 in the time range [1 − ∼140] which corresponds to a expected % change < ∼70%. This is in agreement with the common wisdom where an E-value < 0.01 is considered a good hit for homology matches. 1

S4.1 Statistics of the 5 domain pairs sets
We randomly sampled sets of unique domain pairs from each hierarchical level of SCOP such that each domain appears at most once in the dataset. This comprised 5 sets of domain pairs sampled at the same family, superfamily, fold, class, and decoy levels respectively. See Table ST1 for more information on the datasets.
Next we computed the expected inter-alignment distance and the inferred Markov time parameter (time marginal ) for each set of domain pairs over all possible combinations of the 3 time-parameterized models: BLOSUM [Henikoff and Henikoff, 1992], MMLSUM [Sumanaweera et al., 2020] and VTML [Müller et al., 2002] and the 2 reference structure alignment programs: TM-align [Zhang and Skolnick, 2005], and MMLigner [Collier et al., 2017]. Each of these 6 possible combinations generated sets of data points respectively. The same set of steps were carried out to compute the first (Q1), second (Q2) and third (Q3) quartile statistics of the excepted inter-alignment distances. Fig. SF7 shows the variation of the expected inter-alignmet distance versus time marginal for the MMLSUM time parameterized model with respect to reference structure alignments reported by MMLigner.